1 Goals

In this in-class lab, we will

To guide you through this material, I will provide examples using the gapminder data that we worked with last week. Then you will work through similar problems using the perceptions data. Put briefly, the perceptions data deals with the perceptions of different words relating to probabilities and numbers. The raw data came from /r/samplesize responses to the following question: What [probability/number] would you assign to the phrase “[phrase]”? You can read more about the perceptions data at (https://github.com/zonination/perceptions).

2 Themes and Color Schemes

As I mentioned in the slides, it can become monotonous to look at 100+ plots with the same gridded gray ggplot background and the same default ggplot color scheme. A simple way to mix things up is to apply a different built-in ggplot theme. Pick your favorite or simply google “custom ggplot themes” for a plethora of options.

You can play around with the built-in ggplot themes or even build a custom theme if you’re feeling ambitious.

  1. Compute the mean probability associated with each phrase in the probly data and plot the results in a bar graph. Add a built-in or custom ggplot theme to this plot. In your plot, make sure you can read the x-axis/phrase labels.

3 Color Schemes

Choosing an appropriate color scheme for your plots can drastically improve the readability of your plots. Sometimes, it is worthwhile to stray from the default colors in ggplot. The viridis package, in particular, has several nice continuous color schemes and can be easily applied to ggplot objects (see scale_color_virids() or scale_fill_viridis()). Aside from viridis, the package RColorBrewer has some nice color palettes, and if you’re incredibly ambitious, you can even create your own color palettes.

  1. Using the numberly data, make a boxplot to visualize the distribution of assigned numbers corresponding to each phrase (i.e., there should be a boxplot for each phrase). Due to the highly skewed distributions, please take the \(\log_{10}\)-transform of the numeric data. Then, add an additional layer to the plot using geom_jitter() to explicitly plot the assigned numbers for each phrase. Finally, add an appropriate (non-default) color scheme to your plot.

4 Heatmaps

Rather than scatterplots, another type of graph that can often be very informative is a heatmap. Rebecca Barter, a student of Bin and a former 215A GSI, developed the superheat package, which can be used to make nice heatmaps. In Figure , we plot a heatmap of the life expectancy across time for various countries. Note that the countries have been clustered via hierarchical clustering with Ward’s linkage (don’t worry if you have no idea what this means yet). This allows us to more easily see patterns in the life expectancies across different groups of countries. Note the difference between the clustered heatmap and the heatmap without clustering. In the clustered heatmap, some of the clusters correspond to coherent geographical regions, which makes intuitive sense.

  1. Use superheat to create a heatmap for the probly data, where the samples are on the x-axis, phrases on the y-axis, and probabilities represented by the heatmap values/colors. Cluster the samples using hierarchical clustering with complete linkage but don’t plot the resulting dendrogram. To the right of the heatmap, plot a bar graph with the average probabilities for each phrase.

5 Pair Plots

Last week, we looked at the relationship between life expectancy over time and gdp vs. life expectancy, but we did so separately. It may be informative to look at multiple pair-wise relationships in the data in a single plot. In the GGally package, the ggpairs() function allows us to plot a matrix of pair plots, showcasing many different pair-wise relationships in the dataset at the same time.

  1. Going beyond pair-wise compairsons between continuous variables, ggpairs also allows for discrete variables in the pair plots. Create the same type of pair plot as above, but include the following four variables from gapminder: population, continent, life expectancy, and GDP per capita. Also, add a theme to your pair plot.

There are a lot of additional options that you can set in the ggpairs() function, so check out the help page for ggpairs. This help page is very informative, and you can do a lot with the ggpairs function.

6 Ridgeline Plots/Joyplots

The ggridges package provides an additional geom attribute called geom_density_ridges that can be added to ggplot objects. geom_density_ridges arranges multiple density plots in a staggered fashion.

  1. Use geom_density_ridges to make a nice visualization of the numberly data. Make sure to take care of skewness and add an appropriate (non-default) color scheme. Please also remove the legend.

7 Side-by-side Plots

Sometimes, it may be useful to organize multiple plots side-by-side. Two packages for doing so are gridExtra::grid.arrange and ggpubr::ggarrange.

  1. One of the advantages of using ggarrange over grid.arrange is the ease of creating a common legend and subplot labels with ggarrange. Below, create two scatterplots using the gapminder data: the first scatterplot showing population vs. life expectancy and the second scatterplot showing gdp per capita vs life expectancy. Color the points in both scatterplots by continent and place these scatterplots side-by-side using ggarrange. Since both scatterplots have the same color legend, use a common legend and set the legend’s position to “below”. Also, set the labels argument to “AUTO”.

  1. Unlike ggarrange, grid.arrange allows for very flexible plotting layouts. In addition to the two subplots that you created in exercise 6, create a third plot using geom_bar which shows the number of data points (i.e., countries) from each continent. Use grid.arrange to create a plot with the population vs life expectancy plot in the upper left quadrant, the GDP per capita vs life expectancy plot in the upper right quadrant, and the bar plot in the lower two quadrants. The end result should be similar to the plot shown in the gridExtra::grid.arrange slide from class.

8 Interactive Plots

Challenge Exercise: Come up with your own interactive visualization of the perception data. Be creative!